CSwR22 Group 2: Bivariate Smoothing

Isin Altinkaya, Dumitru Sebastian Pavel

22.09.22

Objective

Development process

  • Git and GitHub for version management and efficient collaboration
  • Conda environment for R package version management
  • Quarto presentation for slideshow
  • ggplotly for interactive ggplot plots
  • Data simulation for evaluating the performance

Data simulation

simulate_data <- function(from = 0, to = 20, step = 0.1, signal = sin, noise = rnorm) {
  x <- seq(from, to, step)
  data_y <- signal(x)
  y <- data_y + noise(x)
  structure(list(df = data.frame(x = x, y = y), real = data_y), class = "simulation_data")
}

# Example
simulate_data(from=0,to=5,step=1,signal=sin,noise=rnorm)
$df
  x           y
1 0 -0.17402541
2 1 -0.07753331
3 2  0.71228560
4 3  0.11696678
5 4 -1.01539638
6 5 -0.21664791

$real
[1]  0.0000000  0.8414710  0.9092974  0.1411200 -0.7568025 -0.9589243

attr(,"class")
[1] "simulation_data"

Simulated data: The sine wave

set.seed(42)
data <- simulate_data(from=0,to=20,step=0.1,signal=sin,noise=rnorm)

Benchmarking

Benchmarking

Benchmarking

Benchmarking: Fixed lambda

Benchmarking: Size of data

Simulate data from 0 to [20,50,100,200] with a step size of 0.1, where signal is a sine wave and noise is sampled from normal distribution.

Small improvements

diff_v2 <- function(v) {
  v[2:length(v)]-v[1:(length(v)-1L)]
}

#use byte compiling for faster diff()
diff_v3 <- compiler::cmpfun(function(v) {
  v[2:length(v)]-v[1:(length(v)-1L)]
})

diff_v4 <- compiler::cmpfun(function(v) {
  l<-length(v)
  v[2:l]-v[1:(l-1L)]
})

x = rnorm(1000)

all(diff(x)==diff_v2(x))
[1] TRUE
all(diff(x)==diff_v3(x))
[1] TRUE
all(diff(x)==diff_v4(x))
[1] TRUE
microbenchmark(times = 100, unit="ms",diff(x),diff_v2(x),diff_v3(x),diff_v4(x))
Unit: milliseconds
       expr      min        lq       mean    median        uq      max neval
    diff(x) 0.009804 0.0107580 0.01232056 0.0117375 0.0127200 0.040622   100
 diff_v2(x) 0.006211 0.0066965 0.04477809 0.0070665 0.0078245 3.750161   100
 diff_v3(x) 0.006158 0.0066655 0.00728467 0.0069920 0.0074780 0.024128   100
 diff_v4(x) 0.006071 0.0067175 0.00735203 0.0072380 0.0077205 0.011916   100
#thus decided to use diff_v3

Small improvements

Is it worthwhile checking if unsorted and run sort accordingly?

sort_v1<-function(d) sort(d, method="quick")
sort_v2<-function(d) if(is.unsorted(d)) sort(d, method="quick")

d<-rnorm(1000)
sorted<-sort(d)
microbenchmark(sort_v1(d),sort_v1(sorted),sort_v2(d),sort_v2(sorted),times = 10)
Unit: microseconds
            expr    min     lq     mean  median     uq      max neval
      sort_v1(d) 67.683 71.823 190.4760 74.4905 77.795 1235.856    10
 sort_v1(sorted)  4.835  6.389   7.4398  6.8380  7.520   12.660    10
      sort_v2(d) 71.463 74.481 327.9451 83.2600 97.743 2406.094    10
 sort_v2(sorted)  1.213  1.418   2.0184  2.0100  2.555    3.018    10

Conclusion: Depends on what we expect the standard data to look like.

Conclusions

  • Why? Fortran